Photonics-Optimized Processor System

ABSTRACT

A photonics-optimized multi-processor system may include a plurality of processor chips, each of the processor chips comprising at least one input/output (I/O) component. The multi-processor system may also include first and second photonic components. The at least one I/O component of at least one of the processor chips may be configured to directly drive the first photonic component and receive a signal from the second photonic component. A total latency from any one of the processor chips to data at any global memory location may not be dominated by a round trip speed-of-light propagation delay. A number of the processor chips may be at least 10,000, and the processor chips may be packaged into a total volume of no more than 8 m 3 . A density of the processor chips may be greater than 1,000 chips per cubic meter.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority benefit of U.S. Provisional PatentApplication No. 62/151,924, filed on Apr. 23, 2015, which isincorporated herein by reference in its entirety.

TECHNICAL FIELD

The present disclosure generally relates to photonics and, inparticular, to a photonics-optimized processor.

BACKGROUND

Unless otherwise indicated herein, approaches described in this sectionare not prior art to the claims listed below and are not admitted to beprior art by inclusion in this section.

Processor micro-architecture and performance are strong functions of thetechnology used for communication among chips in a multi-chip system.The “Von Neumann bottleneck” is the path between the central processingunit (CPU) and the memory. If the bandwidth of this path is less thanthe CPU requirements, performance will be negatively impacted. Theperformance will also be negatively impacted if latency exceeds the CPUrequirements. In state-of-the-art parallel processing systems, eachprocessor chip contains multiple processing elements, or cores, haslinks to other processor chips, and has links to one or more memorychips.

SUMMARY

The following summary is for illustrative purpose only and is notintended to be limiting in any way. That is, the following summary isprovided to introduce concepts, highlights, benefits and advantages ofthe novel and non-obvious techniques described herein. Selectimplementations are further described below in the detailed description.Thus, the following summary is not intended to identify essentialfeatures of the claimed subject matter, nor is it intended for use indetermining the scope of the claimed subject matter.

In one aspect, a photonics-optimized multi-processor system may includea plurality of processor chips, each of the processor chips comprisingat least one input/output (I/O) component. The multi-processor systemmay also include first and second photonic components. The at least oneI/O component of at least one of the processor chips may be configuredto directly drive the first photonic component and receive a signal fromthe second photonic component. A signal may be either analog or digital.One purpose of the optimization of the processor chip is to minimize thetotal latency from any one of the processor cores to data at any globalmemory location in a multi-chip system. Another purpose of theoptimization is to increase bandwidth from any one of the processorcores to data at any global memory location in a multi-chip system. Theoptimization also seeks to maximize the performance of a processor chipby providing enough memory bandwidth for the maximum number of coresand/or other processing elements on a processor chip. The optimizationfurther seeks to maximize performance for a given set of constraints.Full optimization of a POP for a supercomputer must be accompanied byoptimization of the system design to minimize the length of the IPIphysical media, such as optical fibers. A measure of the success of theoptimization may be that the total latency from any one of the processorchips to data at any global memory location may not be dominated by around trip speed-of-light propagation delay. Success may be achieved fora supercomputer configuration if a number of the processor chips may beat least 10,000, and the processor chips, along with off-chip memorydevices, may be packaged into a total volume of no more than 8 m³. Adensity of the processor chips may be greater than 1,000 chips per cubicmeter.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings are included to provide a furtherunderstanding of the disclosure, and are incorporated in and constitutea part of the present disclosure. The drawings illustrateimplementations of the disclosure and, together with the description,serve to explain the principles of the disclosure. It is appreciablethat the drawings are not necessarily in scale as some components may beshown to be out of proportion than the size in actual implementation inorder to clearly illustrate the concept of the present disclosure.

FIG. 1 is a schematic diagram of a photonics-optimized multi-processorsystem in accordance with the present disclosure.

FIG. 2 is a schematic diagram of another photonics-optimizedmulti-processor system in accordance with the present disclosure.

FIG. 3 is a schematic diagram of still another photonics-optimizedmulti-processor system in accordance with the present disclosure.

FIG. 4 is a schematic diagram of yet another photonics-optimizedmulti-processor system in accordance with the present disclosure.

FIG. 5 is a schematic diagram of a memory hierarchy for aphotonics-optimized processor core in accordance with the presentdisclosure.

FIGS. 6A, 6B and 6C illustrate schematic diagrams of a processor-memorytile in accordance with the present disclosure.

FIG. 7 is a schematic diagram of a Compute Module in accordance with thepresent disclosure.

FIGS. 8A and 8B illustrate, respectively, a top view and a front view ofa supercomputer Compute Unit in accordance with the present disclosure.

FIG. 9 is a schematic diagram of a photonics-optimized multi-processormulti-core system in accordance with the present disclosure.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS Overview

In general a large parallel system tends to have multiple processorchips. Each processor chip may have one or more types of processingelements (herein interchangeably referred to “cores”), and each coreneeds to communicate with all of the other cores in the system. Eachprocessor chip controls access to its local memory, either on-chip,e.g., cache, or off-chip, e.g., dynamic random access memory (DRAM). Anon-chip interconnect (OCI) provides communication among cores, caches,interfaces to other processor chips, memory controllers, I/O, networkcontrollers, and other devices on the same chip. The OCI may compriselinks, a central crossbar switch, a network of switches, or othermechanisms. Cores often need to access remote memory, e.g., memory whichis local to another processor chip. Thus, the Von Neumannprocessor-memory bottleneck also includes the paths among the processorchips, also known as the inter-processor interconnect (IPI). Thebandwidth of the interface between the processor chip and the IPI isknown as the injection bandwidth.

Over time the requirements of a CPU may change according to theapplication software which is being executed. Computation-intensiveapplications primarily use internal registers of the CPU to store inputsand results of computations. When the CPU issues fetch instructions toget the inputs from memory and issues store instructions to put theresults back to memory, the data is usually found in the cache. In thiscase, the bandwidth and latency of the path to the off-chip DRAM memoryhas comparatively little impact on performance. However, fordata-intensive applications, the CPU usually needs to issue fetchinstructions to get inputs from memory and issue store instructions toput the results back to memory. The data is usually not in the cache andan off-chip DRAM memory transaction is required. Accordingly, theprocessor performance decreases because the latency in accessing thememory causes the CPU to wait for the data inputs. Limited memorybandwidth reduces performance because the CPU must slow down until it isissuing memory transactions at the same rate that the memory can processthe instructions.

In certain applications, substantially all of the cores of a processorchip may be accessing remote memory, which implies that substantiallyall of the transactions accessing local memory originate from anotherprocessor chip. In this case, assuming uniform traffic, the injectionbandwidth to the IPI for each processor chip must be the total of theaggregate bandwidth to memory of its cores plus the bandwidth of thechip to local memory to avoid congestion. The injection bandwidth to theIPI must be higher to avoid congestion if the traffic has ‘hot spots’(higher than average traffic to and/or from a subset of the processorchips). Congestion increases latency of the memory accesses.

Processor micro-architectures are designed and carefully optimized toachieve a particular level of performance with a minimum amount ofresources to minimize cost. The system application defines theenvironment and application software characteristics, which, in turn,define the relative importance of parameters such as performance,memory, cost, size, weight, and power. These parameters are verydifferent for various implementations such as, for example, cellphonesand supercomputers.

Current processors are copper-optimized. The correspondingmicroarchitecture is typically constrained by the number of pins on thepackage and the characteristics of the connections to other chips in thesystem. Signals being transmitted through copper connections can beattenuated to the point that they cannot be received. Attenuation is afunction of frequency of the signal and the length of the connection.Thus, for a given length, the data rate is limited and the bandwidth onand off the chip is limited by the number of pins.

The number of pins on a chip package is limited by cost. For instance, aprocessor chip of a cellphone tends to have strict cost constraints andmay have about 750 pins. A processor chip of a server may have about2,000 pins. A processor chip of a mainframe, also known as an“enterprise system”, may have about 9,000 pins. Typically, abouttwo-thirds of the package pins are used to provide power and groundconnections to the chip. The remaining one-third of the pins are usedfor signal connections. The cost of the processor chip also increaseswith the number of pins since the areas of the input/output (I/O) driverand electrostatic discharge (ESD) protection circuits tend to be large.

The number of pins required by the microarchitecture is a function ofthe system design and performance requirements. For instance, acellphone may have only one application processor chip, and so it maynot require pins for the IPI. A server typically has two processor chipsand may only need a relatively small number of pins for the IPI and theoff-chip DRAM memory. The number of pins for the IPI increases with thenumber of processor chips in the system if the processor chip containsthe IPI switches, which is often done to reduce latency. Latencyincreases with the interconnect diameter (number of “hops” betweenprocessors). More ports (higher radix) and the associated pins to theIPI may be added to reduce the diameter. However, this reduces thenumber of pins available for high memory bandwidth. A mainframe may havea large number of pins dedicated to memory bandwidth. The number ofports to the IPI is limited, so scaling may still be limited, in spiteof the increase in total pins.

In view of the above, a photonics-optimized processor (POP) system inaccordance of the present disclosure is designed to use integratedsilicon photonics (ISP) instead of copper for communication among chips.ISP can communicate information to and from a processor chip usingdramatically less power and space than the copper-based communicationtechnology which is in use presently. Alternatively the improvements inpower and space can be used to dramatically increase the bandwidth. Byusing ISP, a computer architect can design a processor with a newmicroarchitecture which has dramatically higher performance and requiresless energy and lower costs to achieve a given level of performance fordata-intensive applications.

Processor with ISP

The system and microarchitecture design spaces change dramatically whenthe cost of communication among chips drops by ten times along each of anumber of dimensions. An ISP-based POP chip in accordance with thepresent disclosure may be in the same cost range as today's copper-basedserver chip, and can have ten to twenty times more bandwidth to memoryand perform data-intensive calculations at 90% efficiency instead of 5%.Embodiments of a POP chip of the present disclosure can also have an IPIwith the “glueless” scalability (not requiring external chips forscaling), bandwidth, and low latency required to build a datacenter,mainframe, or supercomputer with thousands of processor chips. All ofthe active components of the IPI may be on the POP chip, so only passivewaveguides and/or fibers are needed to build a high performance computersystem. Advantageously, ISP may replace all of the signal pins, leavingjust the power and ground pins. As a printed circuit board (PCB) in asystem according to the present disclosure is merely used to deliverpower, the PCB in a system according to the present disclosure becomesmuch simpler and cheaper compared to conventional PCBs. The size of asupercomputer in which embodiments of the present disclosure areimplemented may be reduced fifty times, and so latency due to the speedof light may be reduced by eight times. This reduction is important forsupport of memory models such as shared global memory. For example, ashared memory model may require that the total latency budget for globalmemory accesses is less than 100 ns. The roundtrip propagation delay ina copper-based supercomputer can be greater than 300 ns, which wouldsignificantly reduce performance. In accordance with this disclosure,over 10,000 processor chips may be packaged into a total volume of nomore than 8 cubic meters (m³). The maximum total path length betweenprocessor chips may be 3.6 m and the maximum roundtrip propagation delaymay be 39 ns. The remaining 61 ns of the latency budget may be allocatedto the logic in the path. A density of the processor chips may begreater than 1,000 chips per cubic meter.

The ISP-based POP server chip in accordance with the present disclosuremay also support a number of memory models which simplify softwaredevelopment for datacenters, mainframes, and supercomputers.

However, to achieve these capabilities, the architectures need to beoptimized to a new set of constraints. Simply replacing the copper I/Oon a legacy chip may provide modest improvements at best.

Photonics-Optimized Processor

The architecture of a photonics-optimized processor, or POP, and aPOP-based system in accordance with the present disclosure can delivercapabilities which are not possible with copper. The optimization startsat the system level.

The system design may optimize the POP and the components outside of thePOP. This includes, but is not limited to, the IPI, the multi-levelmemory system from rotating to solid state disk (SSD) to high-speedmemory, e.g., DRAM, using a type of DRAM which provides higher bandwidthand/or lower latency, packaging, electrical power delivery, heatremoval, and connectivity to components such as networks, sensors,displays, and actuators. Many of the characteristics of these systemlevel components affect the parameters for optimizing the POP chip, soit is important to optimize them together to achieve an optimal design.

Optimizing the POP includes, but is not limited to, adding circuitswhich directly connect to photonic components, reducing waferfabrication costs by eliminating CMOS process steps for building highvoltage I/O transistors, increasing the number of cores, increasing thebandwidth of the OCI, increasing the memory bandwidth, decreasinglatency of references to local memory and to global memory, increasingthe number of address bits (in buses, registers, CPU, MMU), increasingthe maximum page size, implementing a cache-coherence system which isscalable to thousands or millions of processor chips, increasing theradix and injection bandwidth of the on-chip IPI switch, and reducingthe number of buffers in the IPI switch because the lengths of the IPIphysical media have been reduced.

A POP can be a general-purpose processor, a special-purpose processor,or a hybrid (heterogeneous processor). A core in a multicore POP may bea general-purpose processor or a special-purpose processor. A multicorePOP comprising two or more types of cores is a heterogeneous POP. Ageneral-purpose processor generally has an instruction set architecture(ISA) and microarchitecture designed for balanced performance across awide range of applications such as, for example, ARM, x86, Power,scalable processor architecture (SPARC) and Microprocessor WithoutInterlocking Pipe Stages (MIPS) A special-purpose processor generallyhas an ISA and microarchitecture designed for high performance incertain applications, but low performance in others. Applicationssuitable for special-purpose processors may include, for example,digital signal processing (DSP), network processors and graphicsprocessing units (GPU). Specialized processors may also be found ininterfaces to sensors or actuators, for example. Very simple types ofspecial-purpose processors may include DMA engines for transferringlarge blocks of data.

A POP which uses wavelength-division multiplexing (WDM) can save energyby replacing a serializer/deserializer (SerDes) with equalization andclock data recovery (CDR) circuits with a parallel-to-serial converterand a forwarded clock at a substantial savings of energy and chip area.The low attenuation of the photonic path dramatically reduces thejitter. Skew is all but eliminated by sending the data and clock ondifferent channels (wavelengths) in the same waveguide/fiber. Reducedjitter and/or skew allow this efficient clock forwarding technique canbe used at much higher data rates in ISP links than in the usual coppercircuits.

Example Implementations

Select example implementations in accordance with the present disclosureare described below. It is noteworthy that some or all featuresdescribed below may be embodied in one single multi-processor system.That is, features of multiple example implementations described belowmay be embodied in the same multi-processor system. In other words, amulti-processor system in accordance with the present disclosure mayinclude some or all of the features described below.

FIG. 1 illustrates a photonics-optimized multi-processor system 100 inaccordance with the present disclosure. Referring to FIG. 1, amulti-processor system 100 may include a plurality of tiles ofphotonics-optimized processors. Each of the photonics-optimizedprocessor tiles (or, for short, photonics-optimized processors)120(1)-120(N) may include a respective one of a plurality of processorchips 121(1)-121(N). Each of the processors 120(1)-120(N) may alsoinclude a respective one of a plurality of photonic wafers or portionsof photonic wafers 123(1)-123(N). Each of the processor chips121(1)-121(N) may include one or more I/O components, such as I/Ocomponents 122(1)-122(P) shown in FIG. 1. Each of the I/O components122(1)-122(P) may be configured to transmit or receive information, dataand/or signals. Each of the photonic wafers 123(1)-123(N) may include aplurality of photonic components, such as photonic components113(1)-113(O) shown in FIG. 1. For each of the processors 121(1)-121(N),each of the I/O components 122(1)-122(P) may be connected with one ofthe photonic components of the processor. The connection may beminimized by placing the respective photonic wafer in close proximity ofthe respective processor chip. For example, in the example shown in FIG.1, photonic wafer 123(1) may be disposed directly above processor chip121(1), and a short connection, such as 30 μm of vertical copperpillars, may be used to connect I/O component 122(1) and photoniccomponent 113(1) so as to minimize the physical distance of theconnection therebetween. Similarly, each of the I/O components of therespective processor chip may be connected to circuits of the processorchip using short connections so as to minimize the distance between theI/O component and the circuits, while the circuits serve as a source orsink of the information, data and/or signals transmitted or received.For example, in the example shown in FIG. 1, the physical distance ofthe connection between I/O component 122(1) and data 124(1) may beminimized by using a short connection.

Processors 120(1)-120(N) may be interconnected with one another via aninter-processor interconnect (IPI) 110 such that the processors120(1)-120(N) may communicate with one another. IPI 110 may include awaveguide assembly, and the waveguide assembly may include a pluralityof optical fibers and/or other types of waveguide. IPI 110 may alsoinclude one of more fibers configured for WDM, as well as couplers eachof which is disposed on an optical path of the one or more fibers. Thecouplers are used to connect the fibers to the waveguide assembly. Thefibers, configured for WDM, may be single-mode fibers, and the couplersmay be either a grating coupler or an edge coupler. Each of theprocessors 120(1)-120(N) may accordingly connect to IPI 110 with one ormore of its respective photonic components connected to the fibersconfigured for WDM. In the example shown in FIG. 1, IPI 110 includes awaveguide assembly 111, fibers 140(1)-140(S) configured for WDM, and aplurality of couplers 112(1)-112(M) each disposed on an optical path ofthe fibers 140(1)-140(S). Photonic components 113(1)-113(O) areconnected to IPI 110 through the fibers 140(1)-140(S), establishing theinterconnection between the processors 120(1)-120(N). Communicationsamong the processors 120(1)-120(N) can thus be realized. For example, acommunication path between processors 120(1) and 120(N) may be formed bya series of connections, starting from photonic component 113(1) tocoupler 112(1) through fiber 140(1), then from coupler 112(1) to coupler112(M) through waveguide assembly 111, and then from coupler 112(M) tophotonic component 113(O) through the fiber 140(S). The I/O components122(1)-122(P) may be directly connected to photonic components113(1)-113(O) to communicate with, e.g., transmit information, dataand/or signals to and receive information, data and/or signals from,couplers 112(1)-112(M) of IPI 110. Some or all of the I/O components122(1)-122(P) may be configured or connected to transmit data. In theexample shown in FIG. 1, I/O components 122(1) of processor chip 121(1)and 122(4) of processor chip 121(N) are configured to transmit data124(1) and 124(Q), respectively, of a plurality of data 124(1)-124(Q).Some or all of the I/O components 122(1)-122(P) may be connected to aforwarded clock device. In the example shown in FIG. 1, I/O components122(2) of processor chip 121(1) and 122(5) of processor chip 121(N) areconnected to forwarded clock devices 114(1) and 114(R), respectively, ofa plurality of forwarded clock devices 114(1)-114(R). The forwardedclock devices 114(1)-114(R) may be configured to generate and provideforwarded clock signals 115(1)-115(R).

Each of the processor chips 121(1)-121(N) may also include a respectiveone of a plurality of internal clocks 125(1)-125(N), each of which isconnected to a respective I/O component. In the example shown in FIG. 1,internal clock 125(1) is connected to I/O component 122(3), and internalclock 125(N) is connected to I/O component 122(P). Each of the internalclocks 125(1)-125(N) may serve as a master clock for the respectiveprocessor chip. For example, internal clock 125(1) may serve as a masterclock for processor chip 121(1). Also, each of the internal clocks125(1)-125(N) may be controlled by phase and/or frequency information,received by the respective I/O component, from the one or more externalsources. In the example shown in FIG. 1, internal clock 125(1) iscontrolled by phase and/or frequency information 150(1) received by I/Ocomponent 122(3), and internal clock 125(N) is controlled by phaseand/or frequency information 150(2) received by I/O component 122(P). Inaddition, for some or all of the processor chips 121(1)-121(N), therespective internal clock may be connected to the respective forwardedclock device. In the example shown in FIG. 1, internal clocks 125(1) and125(N) are connected to forwarded clock devices 114(1) and 114(R),respectively.

Multi-processor system 100 may operate isochronously. The isochronousoperation may be achieved by means of a distributed phase locked loop130. One or more of the internal clocks 125(1)-125(N) may participate inthe distributed phase locked loop 130 among two or more of the processorchips 121(1)-121(N) of multi-processor system 100. Each of the internalclocks that participate in the distributed phase locked loop may exporta clock signal, through the IPI 110, to one or more others of theparticipating internal clocks, using the forwarded clock deviceconnected to the internal clock. In the example shown in FIG. 1, aparticipating internal clock 125(1) of processor tile 120(1) may exportthe clock signal (represented by phase and/or frequency information150(3)) to IPI 110 through forwarded clock device 114(1), I/O component122(2) and photonic component 113(2). On the other end of the IPI 110,another participating internal clock 125(N) of processor tile 120(N) mayreceive the clock signal exported by internal clock 125(1) (representedby phase and/or frequency information 150(2)) through photonic component113(O) and I/O component 122(P). Each of the internal clocks125(1)-125(N) that participate in the distributed phase locked loop mayreceive phase and/or frequency information exported by one or moreothers of the participating internal clocks. Each of the participatinginternal clocks adjusts its own clock signal accordingly andsimultaneously for a same period of time while the exporting andreceiving of the clock signals is taking place, until the clock signalsof all participating internal clocks 125(1)-125(N) converge to a samefrequency, thereby achieving the isochronous operation. It is notedthat, for each of the participating internal clocks, exporting the clocksignal by using the connected forwarded clock device that is otherwiseused for normal data transmission has an advantage of realizing thedistributed phase locked loop without the need of a dedicated I/Ocomponent, simplifying the system.

FIG. 2 illustrates a photonics-optimized multi-processor system 200 inaccordance with the present disclosure. Referring to FIG. 2,multi-processor system 200 may include a plurality ofphotonics-optimized processor chips 221(1)-221(N). Each of the processorchips 221(1)-221(N) may include one or more I/O components, such as I/Ocomponents 222(1)-222(P) shown in FIG. 2. Each of the I/O components222(1)-222(P) may be configured to transmit or receive information, dataand/or signals. Multi-processor system 200 may also include a waveguideassembly 211 which is photonically connected to the I/O components222(1)-222(P) of the processor chips 221(1)-221(N).

Multi-processor system 200 may further include a central clock device240 that generates a central clock signal 241. At least one of the I/Ocomponents 222(1)-222(P) may be connected to directly receive thecentral clock signal 241 which is distributed photonically (via photoniccomponents of multi-processor system 200) to one or more of processorchips 221(1)-221(N). The central clock signal 241 may be configured toallow at least one of the processor chips 221(1)-221(N) to clockisochronously with one or more others of the processor chips221(1)-221(N).

Multi-processor system 100 may achieve isochronous operation by means ofa hybrid of centralized and distributed techniques.

FIG. 3 illustrates a photonics-optimized multi-processor system 300 inaccordance with the present disclosure. Referring to FIG. 3,multi-processor system 300 may include an electrical-to-optical I/Odevice 310. Electrical-to-optical I/O device 310 may include anelectrical I/O device 312 and a photonic component 311. Electrical I/Odevice 312 may be configured to receive information, data and/or signalselectrically, and transmit the same information, data and/or signals,photonically through photonic component 311. Alternatively oradditionally, multi-processor system 300 may include anoptical-to-electrical I/O device 320. Optical-to-electrical I/O device320 may include an electrical I/O device 322 and a photonic component321. Electrical I/O device 322 may be configured to photonically receiveinformation, data and/or signals through photonic component 321 andoutput the same information, data and/or signals electrically.Electrical I/O device 312 and a photonic component 311 as well aselectrical I/O device 322 and a photonic component 321 may beimplemented as the I/O devices and photonic components inmulti-processor system 100 of FIG. 1 and multi-processor system 200 ofFIG. 2.

FIG. 4 illustrates a photonics-optimized multi-processor system 400 inaccordance with the present disclosure. Referring to FIG. 4,multi-processor system 400 may include a plurality ofphotonics-optimized processor chips 421(1)-421(N). Each of the processorchips 421(1)-421(N) may include one or more I/O components, such as I/Ocomponents 422(1)-422(P) shown in FIG. 4.

Multi-processor system 400 may include a plurality of photoniccomponents 413(1)-413(O). Multi-processor system 400 may also include acommunication interconnect or inter-processor interconnect (IPI) 411,which is photonically connected to the I/O components 422(1)-422(P) ofthe processor chips 421(1)-421(N) through the photonic components413(1)-413(O). That is, each of the I/O components 422(1)-422(P) may beconfigured to transmit or receive information, data and/or signalsthrough a respective one of the photonic components 413(1)-413(O).

Multi-processor system 400 may also include an I/O device 414. At leastone of the I/O components 422(1)-422(P) of at least one of the processorchips 421(1)-421(N) may be directly connected to one or more of thephotonic components 413(1)-413(O) to connect to the I/O device 414. I/Odevice 414 may include a peripheral component interconnect express(PCIe), universal serial bus (USB), Ethernet, storage device, sensor oractuator.

Each or at least one of processor chips 421(1)-421(N) may include acache (or on-chip memory), a switch (or on-chip switch), a memorymanagement unit (MMU), a latency-hiding mechanism, a voltage regulationcircuit, a coherence unit and/or a memory controller. That is, processorchips 421(1)-421(N) may include caches 423(1)-423(N), switches424(1)-424(N), MMUs 425(1)-425(N), latency-hiding mechanisms426(1)-426(N), voltage regulation circuits 427(1)-427(N), coherenceunits 428(1)-428(N) and/or memory controllers 429(1)-429(N). In theexample shown in FIG. 4, processor chip 421(1) includes cache 423(1),switch 424(1), MMU 425(1), latency-hiding mechanism 426(1), voltageregulation circuit 427(1), coherence unit 428(1) and memory controller429(1). Likewise, in the example shown in FIG. 4, processor chip 421(N)includes cache 423(N), switch 424(N), MMU 425(N), latency-hidingmechanism 426(N), voltage regulation circuit 427(N), coherence unit428(N) and memory controller 429(N).

Multi-processor system 400 may also include one or more external devices430. In some implementations, as shown in FIG. 4, the one or moreexternal devices 430 may include a plurality of memory devices412(1)-412(M) that are external to the processor chips 421(1)-421(N).Each of the memory devices 412(1)-412(M) may be connected to IPI 411 andassociated with a respective one of the processor chips 421(1)-421(N)and configured to support a plurality of memory models. At least one ofthe I/O components 422(1)-422(P) may be directly connected to one ormore of the photonic components 413(1)-413(O) to provide addressinformation 440 to at least one of the memory devices 412(1)-412(M).Each or at least one of the memory devices 412(1)-412(M) may becache-coherent with the cache of the associated one of the processorchips 421(1)-421(N).

Multi-processor system 400 may also include a directory subsystem 415 tokeep precise information about the location of shared memory blocks.When the state of a shared memory block changes, coherence messages maybe sent to all of the sharers (one or more of the memory devices412(1)-412(M) that contain a copy of that block) and none of thenon-sharers to minimize coherence traffic. The coherence units428(1)-428(N) may use information provided by the directory subsystem415 to implement a scalable coherence protocol in which the coherencetraffic is O(N), where N is the number of cores or processor chips. Thecoherence traffic of the coherence protocol is greatly reduced ascompared to that of non-coherence protocols, or snoopy protocols, thecoherence traffic of which is O(N²). The O(N²) coherence traffic ofnon-coherence protocols results from the fact that the total number ofcoherence messages is proportional to the number of cores or processorchips and all coherence messages are sent to all cores or processorchips. The coherence protocol implemented by the multi-processor system400 is therefore scalable with the total number of processor chips inthe system, an advantage that snoopy protocols do not provide.

In order to maintain precise tracking of sharers and thus low coherencetraffic, directory subsystem 415 may store a directory entry for everymemory block. The directory entry may contain the state of that therespective memory block and a list of every sharer of the block. Thephotonics-optimized processors (POPs) of the present disclosure areoptimized for systems with large numbers of processor chips, soscalability is important. A simple but non-scalable implementation is toprovide storage for every memory block for all potential sharers. Forexample, a directory may have a table in which there is a row for everymemory block and the row contains a directory entry which contains thestate of the block and a bit vector in which there is a bit for everypotential sharer. For this non-scalable implementation, the size of thebit vector must be increased as the number of processor chips grows. Analternative, scalable method employed by the present disclosure is toreplace the bit vector with i pointers to potential sharers. This methodis scalable since i does not depend on the number of processor chips.However, the optimal value of i depends on the degree to which thesoftware causes the memory block to be shared. If the block is sharedamong more than i caches, extra invalidations occur, which reducesperformance. The maximum number of sharers has a practical limit becausethe performance and scalability of parallel software typically dependson avoiding sharing memory blocks among a large number of processors.For example, if the software creates a “hot spot” through excessivesharing of a memory block, the software performance may be limited bycongestion in the IPI, queuing at the memory controller, or otherfactors, rather than by a shortage of pointer storage in the directory.Proper hardware-software co-design can ensure that there is a reasonablebound on the number of sharers per block and that there is feedback tothe software to improve performance by staying within the bound. Thehardware architectural goal is to minimize extra invalidations whileminimizing the total storage allocated to pointers.

The average number of sharers per memory block is independent of thenumber of processors and can be estimated as the ratio of the size ofthe caches on a processor chip divided by the size of its local off-chipmemory. For example, a processor chip may have 32 megabytes (MB) ofcache and 64 gigabytes (GB) of local off-chip memory. Only one out of2,000, or 0.05%, of the memory blocks will fit into the cache at thesame time. Typical architectures allocate at least one pointer for everymemory block, so 1,999 out of 2,000 would be unused. Furthermore, ifthere are multiple sharers for a memory block, there is no mechanism forstoring the pointers for these additional sharers in the unuseddirectory entries for other memory blocks.

The directory subsystem 415 may be configured to support varying numbersof sharers per memory block. The directory subsystem 415 may beconfigured with a hashing function or other means to compress thedirectory table to minimize storage for pointers allocated to memoryblocks which have no sharers. The number of pointers allocated to memoryblocks which do have sharers may be varied by changing the size of thelist element in the directory table or by chaining additional storagefor pointers to the list. The performance of directory subsystem 415 maybe improved by caching some of the directory information.

The present invention may use the properties of ISP and/orthree-dimensional (3D) dynamic random access memories (DRAMs) tominimize the cost of directory storage and optimize performance. ISP mayprovide sufficient bandwidth for a processor chip to simultaneouslyaccess two or more 3D DRAMs. The directory information may thus bestored in a 3D DRAM separately from the memory block data and retrievedin parallel with accessing the memory block data without increasing theaccess time to the memory block data.

A 3D DRAM stack may include one or more logic chips which may be used toimplement a directory subsystem 415. Such a directory subsystem hasextremely high bandwidth and low latency access to directory informationstored in the 3D DRAM stack.

ISP may provide sufficient bandwidth so that the 3D DRAM may have two ormore types of interface. The two or more types of interface may beimplemented simultaneously on one or all of its logic chips. One type ofinterface may be a very simple and fast interface to the associatedrespective processor chip and/or to another 3D DRAM, e.g., one thatcontains directory information. Another type of interface may becompatible with the IPI so that other processor chips may access the 3DDRAM directly without consuming their own bandwidth. A switch with twoor more ports to the IPI may be employed to provide redundancy andincrease bandwidth to the IPI.

FIG. 5 illustrates a memory hierarchy 500 for a photonics-optimizedprocessor core in accordance with the present disclosure. Aphotonics-optimized processor chip, such as any of the processor chips121(1)-121(N) of FIG. 1, 221(1)-221(N) of FIGS. 2 and 421(1)-421(N) ofFIG. 4, may include a plurality of processor cores. Referring to FIG. 5,each processor core may have a memory hierarchy that may include arespective CPU 550, a respective MMU 510 and a respective level 1 (L1)cache 520. Other components of the memory hierarchy may be shared amonga plurality of cores. The shared components of the memory hierarchy mayinclude a plurality of level 2 (L2) caches 530, a plurality of coherenceunits 560, a plurality of memory controllers 570, a plurality of mainmemory devices 580 and a plurality of IPI interfaces 590. Aphotonics-optimized multi-processor system may include many times moreprocessor chips and memory devices than a typical legacy system, andthus require a larger address space to be addressed efficiently. Forexample, to contain 10,000 times more memory devices, the additionaladdress space may require the size of the virtual address (VA) andphysical address (PA) to be increased by 14 bits. Furthermore, the sizesof various components of the memory hierarchy must be increasedaccordingly. A preferred embodiment, as shown in FIG. 5, may have a64-bit VA 511 and a 64-bit PA 518. The size of the virtual page number512 and the page offset 513 may need to be increased so that thecombined size of the two is 64 bits, instead of the 48-bit combined sizeused in typical legacy processors. The size of the virtual page number512 may be increased to allow a larger number of virtual pages in the VAspace. The maximum size of the page offset 513 may be increased to 60 or64 bits to allow very large page sizes. The memory hierarchy may alsoinclude a translation lookaside buffer (TLB). The sizes of the TLB tagcompare address 514 and TLB tag 516 may be increased to accommodate theincreased number of virtual pages. The TLB data 517, the L1 tag compareaddress 521, the L1 cache tag 524, the L2 tag compare address and the L2cache tag may be increased to accommodate the larger PA. Similar changesmay be made to the coherence unit 560 and the IPI interface 590 toaccommodate the increased size of the PA. Other components of the memoryhierarchy may not be affected by the changes in the VA and PA. The sizeof the TLB index 515 is determined by the number of entries in the TLB510. The TLB comprises the TLB tags 516 and the TLB data 517. The sizeof the L1 cache index 522 is determined by the number of blocks in theL1 cache 520. The size of the L1 block offset is determined by thenumber of bytes in a L1 cache block. The L1 cache is comprised of the L1cache tags 524 and the L1 data 525. The size of the L1 data 525 isdetermined by the size of a L1 cache block. The size of the L2 cacheindex 532 is determined by the number of blocks in the L2 cache 530. TheL2 block offset 533 is determined by the number of bytes in a L2 cacheblock. The address size for the memory controller 570 is determined bythe size of the main memory 580.

A photonics-optimized processor may be packaged and assembled so that asupercomputer comprising 32,000 or more processor chips and 2 petabytesof memory may fit in a very small volume. Advanced techniques inphotonic interconnect, packaging, power delivery, and cooling technologymay be required to achieve such a high density of processing capability.Each processor chip may be assembled using 2.5-dimensional (2.5D)integration techniques as part of a processor-memory tile. FIGS. 6A, 6Band 6C illustrate a processor-memory tile 600 in accordance with thepresent disclosure. Referring to FIG. 6A, a photonics layer 610 may beused as a substrate for the tile. The processor chip 620 and the 3D DRAMmemory stacks 630 may be flip-chip bonded to contacts on the photonicslayer 610. The grating couplers 640 and the waveguides 650 areconfigured to carry light in fibers, thereby forming communication linksto other processor chips (not shown) and to photonic components (notshown) under the processor chip 620. The waveguides 650 may also formcommunication links between the processor chip 620 and 3D DRAM memorystacks 630 on the same photonic layer 610. Referring to thecross-section view of FIG. 6B, the photonics layer 610 may comprise aportion of a Silicon-On-Insulator (SOI) wafer. The top silicon layer 612may be 220 nm thick and used to form waveguides therein. The buriedoxide (BOX) layer 611 may be 2 microns thick and used to form part of acladding around the waveguides to achieve total internal reflection. Thelayer of silicon which forms the base wafer 613 may be 775 microns thickand used to provide a desired mechanical strength for theprocessor-memory tile 600. Copper pillars 675 are used to connectvarious devices on the processor chip 620 to the photonics layer 610.For example, a copper pillar 675 connects the transmitter (Tx) circuit622 on the processor chip 620 to the modulator 616 on the silicon layer612. Similarly, a copper pillar 675 connects the receiver (Rx) circuit623 on the processor chip 620 to the photodiode 617 on the silicon layer612. Also, a copper pillar 675, together with a Through Silicon Via(TSV) 670 in the photonic layer 610, a Flip Chip (FC) ball 676, a BallGrid Array (BGA) substrate via 673, a BGA ball 677 and a PCB via 674,provide vertical power and ground connections for the processor chip 620by connecting to the power and ground planes 678 of the PCB 601. The PCB601 is included in FIG. 6B for illustrative purposes, but is not part ofthe tile 600. The BGA substrate 605 may be 1.1 mm thick, while the PCB601 may be 1.6 mm thick. Namely, the distance from a transistor on theprocessor chip 620 to a bottom ground plane of the PCB 601 may be, forexample, as short as 3.78 mm. These vertical connections are muchshorter than connections that would have routed otherwise on ahorizontal metal layer in a conventional method. The total thickness ofthe stack, from the bottom of the PCB 601 to the back side of theprocessor chip 620 may be 5 mm. Optical fiber ribbons 644 are coupledwith grating couplers 640 at an angle that is close to 90 degrees. Theoptical fiber ribbons 644 may have a minimum bend radius of 12.7 mm, sothe horizontal segment 645 may be 17 mm above the bottom of the PCB 601.Thus a processor-memory tile 600 may be packaged in a packaging box withan inside vertical dimension of 17 mm. Alternatively, the optical fiberribbons 644 may be coupled with the grating couplers 640 through someoptical systems that include 45 degree angle mirrors, and thus coupledwith the grating couplers 640 at an angle that is close to 0 degrees. Inthis case, the optical fiber ribbons 644 would extend no more than 6 mmfrom the bottom of the PCB 601 in the vertical direction, and the insidevertical dimension of the packaging box may thus be reduced. TSVs 670also provide power and ground for the memory stacks. TSVs 671 andsubstrate vias 673 provide connections from voltage converter circuits621 on the processor chip 620 to inductors 672 embedded in the BGAsubstrate 605 below. The voltage converter circuits 621 may be used tointegrate multiple power supplies on the processor chip 620. Photonicconnections are used for all signals, so the number of copperconnections is greatly reduced, which improves reliability and cost.

Referring to FIG. 6C, photonic components may be formed in the photoniclayer 610 using standard CMOS processing techniques. One or more gratingcouplers 640, waveguides 650, modulators 660 and photodetectors 625 maybe formed therein. One or more grating couplers 641, waveguides 651, andsplitters 652 may also be formed therein to deliver continuous wave (CW)laser power to the modulators 660. An outline of the processor chip 620is illustrated in FIG. 6C to show that the modulators 660 and thephotodetectors 625 may be underneath the processor chip 620. The Txcircuit 622 and the Rx circuit 623 of the processor chip may be placedsuch that they are directly above the modulators 660 and photodetectors625. This placement is advantageous because it reduces the length ofelectrical connection paths from a conventional 150 mm to less than 50microns.

The use of integrated photonics and the relevant packaging techniquesdescribed above result in a significant improvement in the electricalperformance of the photonics-optimized processor system according to thepresent disclosure. Meanwhile, a very rather large volume of packagingmaterials, including PCB area, sockets, packages, and connectors, isremoved.

A Compute Module (CM) may package together a plurality ofprocessor-memory tiles, means to provide electrical power and cooling,and connections to ribbon fiber cables. Referring to FIG. 7, a CM 700may include a box 710 enclosing one or more processor-memory tiles 720.The CM 700 may also include a motherboard PCB 730, one or more VoltageRegulator Modules (VRM) 740, and one or more heat exchangers 750. One ormore single-mode ribbon fiber connectors 771 may be mounted on a panelof the box 710. A fiber ribbon 772 may connect a grating coupler 773 ina tile 720 to an inner end of the connector 771. A fiber ribbon cable770 may connect to an outer end of the connector 771. The box 710 may befilled with a dielectric coolant 760 which is in contact with the backside of the processor chips 721 and other components in the tiles 720. Aheat exchanger 750 may contain circulating water or other coolant fluidat a temperature below the boiling point of the dielectric coolant 760,and serve to condense the coolant 760 from vapor phase to liquid phase.Each of the processor chips 721 may have a layer of sintered copperBoiling Enhancement Coating (BEC) 722 on the back side, with the BEC 722serving as a heat spreader. Aided with the BEC 222, the dielectriccoolant 760 may be able to remove a heat flux of 35 W/cm² from each ofthe processor chips 721 when it is heated and changes to vapor phase. Aprocessor chip 721 may not require such a heat spreader if it generatesa lower head flux. This phase-changing liquid cooling system may be muchmore efficient and require substantially less space than an air coolingsystem or a single phase liquid cooling system. The PCB 730 may connectthe VRM 740 to a power source outside of the CM 700. The VRM 740 mayconvert a higher voltage to a lower voltage so that electrical power canbe delivered to the CM using smaller conductors. The PCB 730 may connecta VRM 740 to the tile 720 using large conductors for a high current thatis needed to deliver power at a low voltage. The VRM 740 may be placedclose to the tile 720 to minimize the length of the large conductors. ACM 700 containing 16 processor-memory tiles 720 may be fit into a box710 with outer dimensions of 275 mm in length, 370 mm in width and 20 mmin height.

A plurality of Compute Modules (CMs) may be assembled to form asupercomputer Compute Unit (CU). Referring to a top view of a CU asillustrated in FIG. 8A, a CU 800 may include as many as 10 racks 810 anda Fiber Vault 830. A rack 810 may be 390 mm wide, 2500 mm high and 580mm deep. A Fiber Vault 830 may contain 2 million single-mode fibers andmay be 1950 mm wide, 2500 mm high and 500 mm deep. Referring to a frontview of the CU 800 as illustrated in FIG. 8B, a rack 810 may contain aplurality of CM 820. A rack 810 may have 57 drawers 840 each of which is44.5 mm high. The height of the rack 810 is also known as 1 Rack Unit(RU). A drawer 840 may be 580 mm deep, enough to place two CMs 820, onein the front (as shown in FIG. 8B) and one in the back (not shown).Since the height of a CM 820 is 20 mm, a second layer of two CM 820 maybe placed on top of a first layer of two CM 820. Thus a drawer 840 maycontain 4 CMs 820 and a rack may therefore contain 228 CMs 820. A drawer840 may contain ribbon fiber connectors mounted on a back panel, and theribbon fiber connectors may subsequently blind-mate to connectors on aFiber Vault 830. In this way, photonic connections may be made among allof the CMs 820 in the CU 800. The CU 800 may contain 2,280 CMs 820 witha total of 36,480 processor-memory tiles, or 1,167,360 cores, in 36,480photonics-optimized processor chips, as well as 2.3 petabytes of memory,all in a volume of 8 cubic meters. The maximum distance between any pairof processor chips of the CU may be 3.6 m, and the maximum latency dueto the speed-of-light propagation delay between any pair of processorchips may be 39 ns. A time budget of 100 ns for a global memory accessmay be divided into 2 parts, the first part being 39 ns to account forthe latency due to the speed-of-light propagation delay, and the secondpart being 61 ns to account for the propagation delay caused by the IPIswitches and other logic along a path from a core to a memory. Incontrast, a conventional supercomputer with a similar number of coresand memory size may require a space of 280 cubic meters. A maximumdistance between any pair of processor chips may be 33 m and a maximumlatency due to the speed-of-light propagation delay between any pair ofprocessor chips may be as much as 360 ns. This almost 10 times higherlatency due to the speed-of-light propagation delay makes a time budgetof 100 ns infeasible, and thus the performance of a conventionalsupercomputer is significantly limited. From the comparison, it isobvious that a reduction in size leads to a reduction in latency, whichgives a significant advantage to the present invention.

A system with two or more processor chips, such as the one shown in FIG.9, may be used to illustrate the communication among processor cores andoff-chip memories. Referring to FIG. 9, a multi-processor system 900 mayinclude a plurality of processor chips 910(1), 910(2) . . . 910(N). Eachof the processor chips 910(1), 910(2), . . . , 910(N) may include one ormore cores 915, an on-chip interconnect (OCI) 940, OCI wires 980, amemory controller 920, an IPI interface 955, an IPI switch 950, and oneor more photonic interfaces 960. The memory controller 920 may connectto one or more off-chip memory devices 930 through memory control wires925. A core 915 of a processor chip may communicate with an off-chipmemory device 930 of the same processor chip. This local communicationcan be realized through a path that is successively composed of an OCIwire 980, the OCI 940, another OCI wire 980, the memory controller 920and a memory control wire 925, all of which are of the same processorchip. Alternatively, a core 915 may communicate with an off-chip memorydevice 930 of a different processor chip. For example, a core 915 of theprocessor chip 910(1) may communicate with an off-chip memory device 930of the processor chip 910(2). This remote communication can be realizedthrough a path that is successively composed of an OCI wire 980, the OCI940 of the processor chip 910(1), another OCI wire 980, the IPIinterface 955 of the processor chip 910(1), the IPI switch 950 of theprocessor chip 910(1), a photonics interface 960 of the processor chip910(1), a fiber 970 between the processor chip 910(1) and the processorchip 910(2), a photonics interface 960 of the processor chip 910(2), theIPI switch 950 of the processor chip 910(2), the IPI interface 955 ofthe processor chip 910(2), another OCI wire 980, the OCI 940 of theprocessor chip 910(2), another OCI wire 980, the memory controller 920of the processor chip 910(2) and a memory control wire 925.

In order to achieve satisfactory performance for both the local andremote communications stated above, it is essential for thecommunication paths to have enough bandwidth. Specifically, for each ofthe processor chips of the multi-processor system, the OCI 940 mustprovide enough bandwidth between the cores 915 of the processor chip andthe IPI interface 955 to avoid communication congestion along the path.In addition, for each of the processor chips, the OCI 940 must provideenough bandwidth between the cores 915 of the processor chip and thememory controller 920 to avoid congestion along the path. Furthermore,for each of the processor chips, the OCI 940 must also provide enoughbandwidth between the IPI interface 955 and the memory controller 920 toavoid congestion along the path.

In the case of a low locality application, substantially each of thecores 915 of the multi-processor system will be communicating with otherprocessor chips, the ones to which the core does not belong, so as toaccess global memory. Similarly, substantially all of the communicationsof a memory controller 920 of the multi-processor system will be withcores 915 of other processor chips. Therefore, in this low localitycase, for each of the processor chip, the bandwidth requirements for theIPI interface 955 and the IPI switch 950 will be approximately equal tothe sum of the bandwidth requirements of the cores 915 and the memorycontroller 920.

In one example implementation, a multi-processor system may include aplurality of photonic components and a plurality of processor chips eachof which includes at least one input/output (I/O) component that isdesigned to directly drive a first photonic component (e.g., modulator)of the photonic components or receive a signal from a second photoniccomponent (e.g., photodetector) of the photonic components. Each of theI/O components of the processor chips may include substantiallyphotonics, and/or may not include any high voltage “I/O” transistors. Insome embodiments, a metal path connecting at least one of the processorchips to at least one of the photonic components may be constructed sothat parasitics, e.g., resistance, inductance, capacitance, of the metalpath cause less than 3 dB of signal attenuation. In some embodiments,the capacitance of the metal path may be less than one or severalfemto-Farads (fF). In some embodiments, a length of the metal path maybe less than one or several microns (μm) or millimeters (mm). In someembodiments, an interface circuit that interfaces with the at least oneof the photonic components may be constructed using high performance andlow voltage “core” transistors of the processor. In some embodiments,the processor chip may be manufactured by a process which eliminates thewafer fabrication CMOS process steps for building high voltage I/Otransistors. In some embodiments, at least one of the photoniccomponents may be made monolithically on a separate semiconductor waferor portion of a wafer (photonic wafer). In some embodiments, at leastone of the photonic components may be made monolithically on the samesemiconductor wafer or portion of a wafer as the processor. In someembodiments, at least one of the photonic components on the photonicwafer may include active components, such as the final stage of amodulator driver or a pre-amplifier for the photodetector. In someembodiments, at least one of the photonic components may be connected tothe processor die using 2.5-dimensional (2.5D) or 3-dimensional (3D)packaging technology. In some embodiments, at least some of the photoniccomponents may be contained in the same package. In some embodiments,the multi-processor system may further include two or more I/Ocomponents that are designed as a transceiver pair configured totransmit and receive signals.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes a set of two or moreI/O components which are designed to connect to photonic components forwavelength-division multiplexing (WDM). In some embodiments, themulti-processor system may further include one or more waveguidesconfigured for WDM. In some embodiments, the multi-processor system mayfurther include one or more fibers configured for WDM. In someembodiments, at least one of the one or more fibers may be a single-modefiber. In some embodiments, the processor may be configured to performWDM using one or more fibers where an optical path may include a gratingcoupler. In some embodiments, at least one of the processor chips may beconfigured to perform WDM using one or more fibers where an optical pathmay include an edge coupler. In some embodiments, at least one of theprocessor chips may be configured for WDM using one or more fibers wherean optical path may include a micro-mirror.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes a set of two or moreI/O components which are designed to connect to photonic components. Insome embodiments, at least one of the processor chips may include aforwarded clock device configured to generate a forwarded clock signal,and at least one of the processor chips may be configured to use a datarecovery scheme based on the forwarded clock signal. In someembodiments, at least one of the processor chips may include aparallel-to-serial and serial-to-parallel scheme based on the forwardedclock signal. In some embodiments, the multi-processor system mayfurther include a waveguide or fiber, and the forwarded clock signal anddata may travel in the same waveguide or fiber. In some embodiments, atleast one of the processor chips may be configured to transmit theforwarded clock signal and data using light of different wavelengths.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes at least one I/Ocomponent which is designed to directly receive a central clock signalwhich is distributed photonically (via photonic components of themulti-processor system) to one or more other processor chips. In someembodiments, the central clock signal may be configured to allow atleast one of the processor chips to clock isochronously with one or moreothers of the processor chips. In some embodiments, the central clocksignal received by the processor chips may have a central source, e.g.,a central clock device of the multi-processor system.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes an internal clockand at least one I/O component which is designed to directly interfacewith photonic components of the multi-processor system and receivesphotonically phase and/or frequency information from one or moreexternal sources, e.g., one or more other processors. In someembodiments, the internal clock may be controlled by the phase and/orfrequency information from the one or more external sources. In someembodiments, the at least one I/O component may be configured todirectly interface with photonics and transmit photonically phase and/orfrequency information to at least one of the processor chips. In someembodiments, the internal clock of the at least one of the processorchips may participate in a distributed phase locked loop among two ofmore of the processor chips of the multi-processor system. In someembodiments, the internal clock may be isochronous with the internalclock of one or more others of the processor chips of themulti-processor system.

In one example implementation, a multi-processor system may include aplurality of photonic components and a plurality of memory devices. Themulti-processor system may also include a plurality of processor chipseach of which includes cache and at least one I/O component which isdesigned to directly connect to the photonic components to transmit andreceive data with at least one of the memory devices through one or moreof the photonic components. In some embodiments, the processor chip mayinclude more than one processing elements, or cores, and each of thecores may have a cache or memory device associated with it. In thiscase, a bandwidth to a memory device associated with one of theprocessor chips may be greater than 50% of an aggregate bandwidth to thecache of the plurality of the cores on that processor chip. In someembodiments, at least one of the memory devices may include DRAM. Insome embodiments, at least one of the memory devices may includenon-volatile random access memory (NVRAM), e.g., flash memory.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes a cache and at leastone I/O component which is designed to directly connect to photoniccomponents to communicate data with one or more other processors. Insome embodiments, each or at least one of the processor chips may alsoinclude an on-chip switch. In some embodiments, the on-chip switch maybe configured to provide full global bandwidth to an IPI that connectsthe processor chips. In some embodiments, the on-chip switch may beconfigured to provide an injection bandwidth greater than 50% of anaggregate bandwidth to memory of a plurality of cores on the processorchip in a multi-processor system. In some embodiments, the on-chipswitch may be configured to provide an injection bandwidth greater than200% of an aggregate bandwidth to memory of a plurality of cores on theprocessor chip in a multi-processor system. It is so configured toprovide good performance in the case that all cores are accessing memorythat is directly attached to a different processor chip. In someembodiments, the on-chip switch may be configured with a radix greaterthan or equal to a number required so that a number of hops required toreach any other processor chip is not greater than 3. In someembodiments, each or at least one of the processor chips may furtherinclude a switch that uses a table-driven router. In some embodiments,each or at least one of the processor chips may further include a switchthat performs dynamic routing. In some embodiments, each or at least oneof the processor chips may further include a switch that uses a bufferpool architecture.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes at least one I/Ocomponent which is designed to directly connect to photonic componentsto provide address information to an external device, e.g., DRAM orprocessor. In some embodiments, the multi-processor system may furtherinclude a physical address (PA) component configured to address all ofthe physical memory in the system, including the memory connected to allof the processor chips. In some embodiments, each or at least one of theprocessor chips may further include a plurality of memory devices and amemory management unit (MMU). The MMU is configured to accept virtualaddresses (VAs) up to 64 bits or more and produce physical addresses(PAs) up to 64 bits or more. In some embodiments, each or at least oneof the processor chips may further include a plurality of memory devicesand a memory management unit (MMU). The MMU is configured to managelarge pages such that mappings for all memory devices in the system canbe contained in a translation lookaside buffer (TLB), e.g., of less than1,000 entries, simultaneously.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes at least one I/Ocomponent which is designed to directly connect to photonic componentsfor communication and that has a latency-hiding mechanism. In someembodiments, the latency-hiding mechanism may include hardware threads,e.g., simultaneous multi-threading (SMT) threads. In some embodiments,the latency-hiding mechanism may include multiple outstanding memoryreferences. In some embodiments, the latency-hiding mechanism mayinclude a use of barrier and cache coherence protocols.

In one example implementation, a multi-processor system may include aplurality of photonic components and a plurality of memory devices. Themulti-processor system may also include a plurality of processor chipseach of which includes a cache and at least one I/O component which isdesigned to directly connect to one or more of the photonic componentsto provide address information to at least one of the memory devices.The memory devices may be external to the processor chips. Each of thememory devices may be associated with a respective one of the processorchips and configured to support various memory models. In someembodiments, each or at least one of the memory devices may becache-coherent with the associated one of the processor chips. In someembodiments, each or at least one of the memory devices may be notcache-coherent with the associated one of the processor chips. In someembodiments, the memory models may be user-controllable. In someembodiments, each or at least one of the processor chips may furtherinclude a MMU, and a selection of the memory models may be auser-controllable attribute of page mapping in the MMU.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes at least one I/Ocomponent which is designed to directly connect to photonic componentsto connect to at least an I/O device. In some embodiments, the I/Odevice may be a standard interface, such as peripheral componentinterconnect express (PCIe), universal serial bus (USB), Ethernet,Infiniband, and the like. In some embodiments, the I/O device mayinclude a storage device. In some embodiments, the I/O device mayinclude a sensor or actuator.

In one example implementation, a multi-processor system may includeplurality of photonic components and an off-chip memory. The off-chipmemory may be shared by more than one of the processor chips. Theoff-chip memory may be directly connected to a single processor chip andshared with other processor chips using a global memory architectureimplemented by using a processor-to-processor approach. Themulti-processor system may also include a cache and a plurality ofprocessor chips each of which includes at least one I/O component whichis designed to directly connect to the photonic components tocommunicate with one or more other processor chips. At least one I/Ocomponent of at least one of the processor chips may be configured touse a directory-based cache-coherence protocol. In some embodiments, acache of at least one of the processor chips may be configured to storedirectory information. In some embodiments, the off-chip memory mayinclude a DRAM. In some embodiments, directory information may be storedin the off-chip memory and the on-chip cache of at least one of theprocessor chips. In some embodiments, the multi-processor system mayfurther include a directory subsystem configured to separate theoff-chip memory data and the directory information on to two differentoff-chip memories. In some embodiments, the multi-processor system mayfurther include a directory subsystem configured with some of thesubsystem implemented on a high performance chip which is part of the 3DDRAM memory stack. In some embodiments, the multi-processor system mayfurther include a directory subsystem configured to support varyingnumbers of sharers per memory block. In some embodiments, themulti-processor system may further include a directory subsystemconfigured to support varying numbers of sharers per memory block usingcaching. In some embodiments, the multi-processor system may furtherinclude a directory subsystem configured to support varying numbers ofsharers per memory block using hashing to entries with storage fordifferent numbers of pointers to sharers. In some embodiments, themulti-processor system may further include a directory subsystemconfigured to use hashing to reduce storage allocated to memory blockswith zero sharers.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes at least one I/Ocomponent which is designed to directly connect to photonic components.Each or at least one of the processor chips may be liquid cooled. Insome embodiments, the multi-processor system may further include acooling mechanism and a liquid coolant contained in the coolingmechanism. The liquid coolant may be in direct contact with a back sideof a processor die of at least one of the processor chips. In someembodiments, the liquid coolant may change phase to a vapor as part of aheat transfer process. In some embodiments, the vaporized liquid coolantmay be condensed by a heat exchanger of the cooling mechanism containinga secondary fluid. In some embodiments, the secondary fluid may be of adifferent type than the liquid coolant. In some embodiments, a heat fluxfrom the processor die of at least one of the processor chips may beenhanced by impingement. For instance, the liquid coolant may beimpinged on the back side of a processor die of at least one of theprocessor chips.

In one example implementation, a multi-processor system may include aplurality of photonic components and a plurality of processor chips eachof which includes at least one I/O component which is designed todirectly connect to the photonic components. Each of the processor chipsmay also include a voltage regulation circuit configured to regulate avoltage of one or more of the processor chips. In some embodiments, thevoltage regulation circuit of each of the processor chips may provideone or more voltage domains of the respective processor chip. In someembodiments, the multi-processor system may further include one or moreadditional electronic components, e.g., inductors, as part of thepackage.

In one example implementation, a multi-processor system may include aplurality of processor chips each of which includes at least one I/Ocomponent which is designed to directly connect to photonic components.The processor chips may be packaged so a total latency from any one ofthe processor chips to data at any global memory location may not bedominated by a round trip speed-of-light propagation delay. In someembodiments, the multi-processor system may include at least 10,000processor chips and may be packaged into a total volume of no more than8 m³. In some embodiments, a density of the processor chips may begreater than 1,000 chips per cubic meter. In some embodiments, a latencyof the multi-processor system, having more than 1,000 processor chips,may be less than 200 nanoseconds (ns).

In one example implementation, a multi-processor system may include aninter-processor interconnect (IPI) and a plurality of processor chips.The processor chips are configured to communicate data to one anotherthrough the IPI. Each of the processor chips may include one or morecores and one or more level 1 (L1) caches. Each of the L1 caches may beassociated with a respective core through a respective core-cachebandwidth. Each of the processor chips may also include at least onememory controller and one or more local memory devices. Each of thelocal memory devices may be associated with the at least one memorycontroller through a respective local memory bandwidth. Each of theprocessor chips may further include an on-chip interconnect (OCI) thatis associated with the one or more cores and the at least one memorycontroller of that processor chip. The OCI is also associated with theIPI of the multi-processor system. The association between the OCI andthe plurality of cores of that processor chip is through a bandwidththat is greater than 50% of an aggregate core bandwidth, which isapproximately the sum of each core-cache bandwidth of that processorchip. The association between the OCI and the at least one memorycontroller of that processor chip is through a bandwidth that is greaterthan 50% of an aggregate memory bandwidth, which is approximately thesum of each local memory bandwidth of that processor chip. Theassociation between the OCI and the IPI of the multi-processor system isthrough an injection bandwidth. In some embodiment, the injectionbandwidth is greater than 50% of the aggregate core bandwidth of thatprocessor chip. In some embodiment, the injection bandwidth is greaterthan 50% of a sum of the aggregate core bandwidth and the aggregatememory bandwidth of that processor chip.

ADDITIONAL NOTES

The herein-described subject matter sometimes illustrates differentcomponents contained within, or connected with, different othercomponents. It is to be understood that such depicted architectures aremerely examples, and that in fact many other architectures can beimplemented which achieve the same functionality. In a conceptual sense,any arrangement of components to achieve the same functionality iseffectively “associated” such that the desired functionality isachieved. Hence, any two components herein combined to achieve aparticular functionality can be seen as “associated with” each othersuch that the desired functionality is achieved, irrespective ofarchitectures or intermedial components. Likewise, any two components soassociated can also be viewed as being “operably connected”, or“operably coupled”, to each other to achieve the desired functionality,and any two components capable of being so associated can also be viewedas being “operably couplable”, to each other to achieve the desiredfunctionality. Specific examples of operably couplable include but arenot limited to physically mateable and/or physically interactingcomponents and/or wirelessly interactable and/or wirelessly interactingcomponents and/or logically interacting and/or logically interactablecomponents.

Further, with respect to the use of substantially any plural and/orsingular terms herein, those having skill in the art can translate fromthe plural to the singular and/or from the singular to the plural as isappropriate to the context and/or application. The varioussingular/plural permutations may be expressly set forth herein for sakeof clarity.

Moreover, it will be understood by those skilled in the art that, ingeneral, terms used herein, and especially in the appended claims, e.g.,bodies of the appended claims, are generally intended as “open” terms,e.g., the term “including” should be interpreted as “including but notlimited to,” the term “having” should be interpreted as “having atleast,” the term “includes” should be interpreted as “includes but isnot limited to,” etc. It will be further understood by those within theart that if a specific number of an introduced claim recitation isintended, such an intent will be explicitly recited in the claim, and inthe absence of such recitation no such intent is present. For example,as an aid to understanding, the following appended claims may containusage of the introductory phrases “at least one” and “one or more” tointroduce claim recitations. However, the use of such phrases should notbe construed to imply that the introduction of a claim recitation by theindefinite articles “a” or “an” limits any particular claim containingsuch introduced claim recitation to embodiments containing only one suchrecitation, even when the same claim includes the introductory phrases“one or more” or “at least one” and indefinite articles such as “a” or“an,” e.g., “a” and/or “an” should be interpreted to mean “at least one”or “one or more;” the same holds true for the use of definite articlesused to introduce claim recitations. In addition, even if a specificnumber of an introduced claim recitation is explicitly recited, thoseskilled in the art will recognize that such recitation should beinterpreted to mean at least the recited number, e.g., the barerecitation of “two recitations,” without other modifiers, means at leasttwo recitations, or two or more recitations. Furthermore, in thoseinstances where a convention analogous to “at least one of A, B, and C,etc.” is used, in general such a construction is intended in the senseone having skill in the art would understand the convention, e.g., “asystem having at least one of A, B, and C” would include but not belimited to systems that have A alone, B alone, C alone, A and Btogether, A and C together, B and C together, and/or A, B, and Ctogether, etc. In those instances where a convention analogous to “atleast one of A, B, or C, etc.” is used, in general such a constructionis intended in the sense one having skill in the art would understandthe convention, e.g., “a system having at least one of A, B, or C” wouldinclude but not be limited to systems that have A alone, B alone, Calone, A and B together, A and C together, B and C together, and/or A,B, and C together, etc. It will be further understood by those withinthe art that virtually any disjunctive word and/or phrase presenting twoor more alternative terms, whether in the description, claims, ordrawings, should be understood to contemplate the possibilities ofincluding one of the terms, either of the terms, or both terms. Forexample, the phrase “A or B” will be understood to include thepossibilities of “A” or “B” or “A and B.”

From the foregoing, it will be appreciated that various embodiments ofthe present disclosure have been described herein for purposes ofillustration, and that various modifications may be made withoutdeparting from the scope and spirit of the present disclosure.Accordingly, the various embodiments disclosed herein are not intendedto be limiting, with the true scope and spirit being indicated by thefollowing claims.

What is claimed is:
 1. A multi-processor system, comprising: a pluralityof processor chips, each of the processor chips comprising: a cache; andat least one input/output (I/O) component; and a plurality of photoniccomponents, wherein the at least one I/O component of at least one ofthe processor chips is configured to communicate data to one or moreothers of the processor chips through one or more of the photoniccomponents.
 2. The multi-processor system of claim 1, wherein each ofthe processor chips further comprises: an on-chip switch configured toprovide a global bandwidth to an inter-processor interconnect (IPI) thatconnects the processor chips.
 3. The multi-processor system of claim 2,wherein the on-chip switch is configured with a radix greater than orequal to a number required so that a number of hops required to reachany other processor chips is not greater than
 3. 4. The multi-processorsystem of claim 1, wherein each of the processor chips furthercomprises: a switch configured to use a table-driven router.
 5. Themulti-processor system of claim 1, wherein each of the processor chipsfurther comprises: a switch configured to perform dynamic routing. 6.The multi-processor system of claim 1, wherein each of the processorchips further comprises: a switch configured to use a buffer poolarchitecture.
 7. The multi-processor system of claim 1, furthercomprising: a plurality of memory devices, wherein the at least one I/Ocomponent of at least one of the processor chips is configured tocommunicate data with at least one of the memory devices through one ormore of the photonic components.
 8. The multi-processor system of claim7, wherein at least one of the memory devices comprises a dynamic randomaccess memory (DRAM).
 9. The multi-processor system of claim 7, whereinat least one of the memory devices comprises a non-volatile randomaccess memory (NVRAM).
 10. The multi-processor system of claim 9,wherein at least one of the memory devices comprises a flash memory. 11.The multi-processor system of claim 1, wherein the at least one I/Ocomponent of at least one of the processor chips is configured toprovide address information to an external device through one or more ofthe photonic components.
 12. The multi-processor system of claim 11,wherein the external device comprises a dynamic random access memory(DRAM) or another one of the processor chips.
 13. The multi-processorsystem of claim 11, further comprising: a plurality of memory devices,wherein each of the processor chips further comprises a memorymanagement unit (MMU) configured to manage large pages such thatmappings for the memory devices can be contained in a translationlookaside table (TLB) simultaneously.
 14. The multi-processor system ofclaim 13, wherein the TLB includes less than 1,000 entries.
 15. Themulti-processor system of claim 13, wherein the photonic components areconfigured with a physical address (PA) capability for addressing thememory devices.
 16. A multi-processor system, comprising: aninter-processor interconnect (IPI); and a plurality of processor chips,each of the processor chips comprising: a plurality of cores; aplurality of level 1 (L1) caches each of which is associated with arespective one of the plurality of cores through a respective core-cachebandwidth; at least one memory controller; a plurality of local memorydevices each of which is associated with the at least one memorycontroller through a respective local memory bandwidth; and an on-chipinterconnect (OCI) that is associated with the plurality of cores, theat least one memory controller, and the IPI, wherein at least one of theprocessor chips is configured to communicate data to one or more othersof the processor chips through the IPI.
 17. The multi-processor systemof claim 16, wherein, for each of the processor chips: the OCI isassociated with the plurality of cores of the processor chip through abandwidth that is greater than 50% of an aggregate core bandwidth, theOCI is associated with the at least one memory controller of theprocessor chip through a bandwidth that is greater than 50% of anaggregate memory bandwidth, the OCI is associated with the IPI throughan injection bandwidth, the aggregate core bandwidth is substantially asum of each core-cache bandwidth of the processor chip, and theaggregate memory bandwidth is substantially a sum of each local memorybandwidth of the processor chip.
 18. The multi-processor system of claim17, wherein the injection bandwidth of each of the processor chips isgreater than 50% of the respective aggregate core bandwidth.
 19. Themulti-processor system of claim 17, wherein the injection bandwidth ofeach of the processor chips is greater than 50% of a sum of therespective aggregate core bandwidth and the respective aggregated localmemory bandwidth.